GitHub Repository: debakarr/machinelearning
Path: blob/master/Part 3 - Classification/Decision Tree/[R] Decision Tree.ipynb
¹³³⁹ views

Kernel: R

Decision Tree

Data preprocessing

In [1]:

# Importing the dataset
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]

In [2]:

head(dataset, 10)

Out[2]:

In [3]:

# Encoding the target feature as factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))

In [4]:

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.80)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

In [5]:

head(training_set, 10)

Out[5]:

In [6]:

head(test_set, 10)

Out[6]:

In [7]:

# Feature Scaling*
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])

In [8]:

head(training_set, 10)

Out[8]:

In [9]:

head(test_set, 10)

Out[9]:

Fitting Decision Tree classifier to the Training set

In [10]:

library(rpart)
classifier = rpart(formula = Purchased ~ .,
                        data = training_set)

Predicting the Test set results

In [11]:

y_pred = predict(classifier, newdata = test_set[-3], type = 'class')

In [12]:

head(y_pred, 10)

Out[12]:

In [13]:

head(test_set[3], 10)

Out[13]:

Making the Confusion Matrix

In [14]:

cm = table(test_set[, 3], y_pred)
cm

Out[14]:

   y_pred
1
43  8
6 23

classifier made 43 + 23 = 66 correct prediction and 6 + 8 = 14 incoreect predictions.

Visualising the Training set results

In [15]:

library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set, type = 'class')
plot(set[, -3],
     main = 'Decision Tree (Training set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'), col='white')
legend("topright", legend = c("0", "1"), pch = 16, col = c('red3', 'green4'))

Out[15]:

Visualising the Test set results

In [16]:

library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set, type = 'class')
plot(set[, -3], main = 'Decision Tree (Test set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'), col='white')
legend("topright", legend = c("0", "1"), pch = 16, col = c('red3', 'green4'))

Out[16]:

Things to remmember while making decison tree classifier:

Normally it overfits the data. But here in R due to 'rpart' library (a very powerful and famous library), the overfitting is drastically reduced as compare to Decision Tree Model in python.
There is no need to Scale the features as decision tree does not depends on Euclidean distance. We are using Feature Scaling here just to get a plot with better resolution. If you ammit scaling, then for example the above case the vector size will be about 200+ GB which is not possible to plot.

If we dont apply feature scaling (i.e. do not run cell 7 all through cell 9), then we can have a look at the decision tree. For this we just need to run 2 lines of code

Constructing decision tree

In [38]:

plot(classifier)
text(classifier)

Out[38]: